The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Mon Jun 8 09:48:44 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Mon Jun  8 09:48:44 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.7.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.7.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.7.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.7.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 138 138
Albania 138 138
Algeria 138 138
Andorra 138 138
Angola 138 138
Antigua and Barbuda 138 138
Argentina 138 138
Armenia 138 138
Australia 1104 1104
Austria 138 138
Azerbaijan 138 138
Bahamas 138 138
Bahrain 138 138
Bangladesh 138 138
Barbados 138 138
Belarus 138 138
Belgium 138 138
Belize 138 138
Benin 138 138
Bhutan 138 138
Bolivia 138 138
Bosnia and Herzegovina 138 138
Botswana 138 138
Brazil 138 138
Brunei 138 138
Bulgaria 138 138
Burkina Faso 138 138
Burma 138 138
Burundi 138 138
Cabo Verde 138 138
Cambodia 138 138
Cameroon 138 138
Canada 1932 1932
Central African Republic 138 138
Chad 138 138
Chile 138 138
China 4554 4554
Colombia 138 138
Comoros 138 138
Congo (Brazzaville) 138 138
Congo (Kinshasa) 138 138
Costa Rica 138 138
Cote d’Ivoire 138 138
Croatia 138 138
Cuba 138 138
Cyprus 138 138
Czechia 138 138
Denmark 414 414
Diamond Princess 138 138
Djibouti 138 138
Dominica 138 138
Dominican Republic 138 138
Ecuador 138 138
Egypt 138 138
El Salvador 138 138
Equatorial Guinea 138 138
Eritrea 138 138
Estonia 138 138
Eswatini 138 138
Ethiopia 138 138
Fiji 138 138
Finland 138 138
France 1518 1518
Gabon 138 138
Gambia 138 138
Georgia 138 138
Germany 138 138
Ghana 138 138
Greece 138 138
Grenada 138 138
Guatemala 138 138
Guinea 138 138
Guinea-Bissau 138 138
Guyana 138 138
Haiti 138 138
Holy See 138 138
Honduras 138 138
Hungary 138 138
Iceland 138 138
India 138 138
Indonesia 138 138
Iran 138 138
Iraq 138 138
Ireland 138 138
Israel 138 138
Italy 138 138
Jamaica 138 138
Japan 138 138
Jordan 138 138
Kazakhstan 138 138
Kenya 138 138
Korea, South 138 138
Kosovo 138 138
Kuwait 138 138
Kyrgyzstan 138 138
Laos 138 138
Latvia 138 138
Lebanon 138 138
Lesotho 138 138
Liberia 138 138
Libya 138 138
Liechtenstein 138 138
Lithuania 138 138
Luxembourg 138 138
Madagascar 138 138
Malawi 138 138
Malaysia 138 138
Maldives 138 138
Mali 138 138
Malta 138 138
Mauritania 138 138
Mauritius 138 138
Mexico 138 138
Moldova 138 138
Monaco 138 138
Mongolia 138 138
Montenegro 138 138
Morocco 138 138
Mozambique 138 138
MS Zaandam 138 138
Namibia 138 138
Nepal 138 138
Netherlands 690 690
New Zealand 138 138
Nicaragua 138 138
Niger 138 138
Nigeria 138 138
North Macedonia 138 138
Norway 138 138
Oman 138 138
Pakistan 138 138
Panama 138 138
Papua New Guinea 138 138
Paraguay 138 138
Peru 138 138
Philippines 138 138
Poland 138 138
Portugal 138 138
Qatar 138 138
Romania 138 138
Russia 138 138
Rwanda 138 138
Saint Kitts and Nevis 138 138
Saint Lucia 138 138
Saint Vincent and the Grenadines 138 138
San Marino 138 138
Sao Tome and Principe 138 138
Saudi Arabia 138 138
Senegal 138 138
Serbia 138 138
Seychelles 138 138
Sierra Leone 138 138
Singapore 138 138
Slovakia 138 138
Slovenia 138 138
Somalia 138 138
South Africa 138 138
South Sudan 138 138
Spain 138 138
Sri Lanka 138 138
Sudan 138 138
Suriname 138 138
Sweden 138 138
Switzerland 138 138
Syria 138 138
Taiwan* 138 138
Tajikistan 138 138
Tanzania 138 138
Thailand 138 138
Timor-Leste 138 138
Togo 138 138
Trinidad and Tobago 138 138
Tunisia 138 138
Turkey 138 138
Uganda 138 138
Ukraine 138 138
United Arab Emirates 138 138
United Kingdom 1518 1518
Uruguay 138 138
US 138 138
US_state 450018 450018
Uzbekistan 138 138
Venezuela 138 138
Vietnam 138 138
West Bank and Gaza 138 138
Western Sahara 138 138
Yemen 138 138
Zambia 138 138
Zimbabwe 138 138
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 5077
Alaska 983
Arizona 1256
Arkansas 5355
California 4788
Colorado 4607
Connecticut 745
Delaware 311
Diamond Princess 83
District of Columbia 84
Florida 5382
Georgia 12103
Grand Princess 84
Guam 84
Hawaii 432
Idaho 2444
Illinois 7042
Indiana 7021
Iowa 6580
Kansas 5513
Kentucky 7906
Louisiana 5060
Maine 1287
Maryland 1961
Massachusetts 1253
Michigan 6166
Minnesota 5874
Mississippi 6277
Missouri 7063
Montana 2186
Nebraska 4197
Nevada 981
New Hampshire 884
New Jersey 1883
New Mexico 2164
New York 4787
North Carolina 7436
North Dakota 2606
Northern Mariana Islands 69
Ohio 6621
Oklahoma 5082
Oregon 2549
Pennsylvania 5204
Puerto Rico 84
Rhode Island 496
South Carolina 3651
South Dakota 3307
Tennessee 7174
Texas 15333
Utah 1149
Vermont 1185
Virgin Islands 84
Virginia 9412
Washington 3268
West Virginia 3490
Wisconsin 5066
Wyoming 1573
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 51614
China 138
Diamond Princess 119
Korea, South 109
Japan 108
Italy 106
Iran 103
Singapore 100
France 99
Germany 99
Spain 98
US 97
Switzerland 95
United Kingdom 95
Belgium 94
Netherlands 94
Norway 94
Sweden 94
Austria 92
Malaysia 91
Australia 90
Bahrain 90
Denmark 90
Canada 89
Qatar 89
Iceland 88
Brazil 87
Czechia 87
Finland 87
Greece 87
Iraq 87
Israel 87
Portugal 87
Slovenia 87
Egypt 86
Estonia 86
India 86
Ireland 86
Kuwait 86
Philippines 86
Poland 86
Romania 86
Saudi Arabia 86
Indonesia 85
Lebanon 85
San Marino 85
Thailand 85
Chile 84
Pakistan 84
Luxembourg 83
Peru 83
Russia 83
Ecuador 82
Mexico 82
Slovakia 82
South Africa 82
United Arab Emirates 82
Armenia 81
Colombia 81
Croatia 81
Panama 81
Serbia 81
Taiwan* 81
Turkey 81
Argentina 80
Bulgaria 80
Latvia 80
Uruguay 80
Algeria 79
Costa Rica 79
Dominican Republic 79
Hungary 79
Andorra 78
Bosnia and Herzegovina 78
Jordan 78
Lithuania 78
Morocco 78
New Zealand 78
North Macedonia 78
Vietnam 78
Albania 77
Cyprus 77
Malta 77
Moldova 77
Brunei 76
Burkina Faso 76
Sri Lanka 76
Tunisia 76
Ukraine 75
Azerbaijan 74
Ghana 74
Kazakhstan 74
Oman 74
Senegal 74
Venezuela 74
Afghanistan 73
Cote d’Ivoire 73
Cuba 72
Mauritius 72
Uzbekistan 72
Cambodia 71
Cameroon 71
Honduras 71
Nigeria 71
West Bank and Gaza 71
Belarus 70
Georgia 70
Bolivia 69
Kosovo 69
Kyrgyzstan 69
Montenegro 69
Congo (Kinshasa) 68
Kenya 67
Niger 66
Guinea 65
Rwanda 65
Trinidad and Tobago 65
Paraguay 64
Bangladesh 63
Djibouti 61
El Salvador 60
Guatemala 59
Madagascar 58
Mali 57
Congo (Brazzaville) 54
Jamaica 54
Gabon 52
Somalia 52
Tanzania 52
Ethiopia 51
Burma 50
Sudan 49
Liberia 48
Maldives 46
Equatorial Guinea 45
Cabo Verde 43
Sierra Leone 41
Guinea-Bissau 40
Togo 40
Zambia 39
Eswatini 38
Chad 37
Tajikistan 36
Haiti 34
Sao Tome and Principe 34
Benin 32
Nepal 32
Uganda 32
Central African Republic 31
South Sudan 31
Guyana 29
Mozambique 28
Yemen 24
Mongolia 23
Mauritania 20
Nicaragua 20
Malawi 14
Syria 14
Zimbabwe 12
Bahamas 11
Libya 11
Comoros 9
Suriname 1
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 138
Korea, South 109
Japan 108
Italy 106
Iran 103
Singapore 100
France 99
Germany 99
Spain 98
US 97
Switzerland 95
United Kingdom 95
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-06-06        18419
## 2    Afghanistan           <NA> <NA> 2020-05-05        18387
## 3    Afghanistan           <NA> <NA> 2020-02-10        18302
## 4    Afghanistan           <NA> <NA> 2020-02-18        18310
## 5    Afghanistan           <NA> <NA> 2020-02-19        18311
## 6    Afghanistan           <NA> <NA> 2020-02-20        18312
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                    327                 19551     0.01672549
## 2                     95                  3224     0.02946650
## 3                      0                     0            NaN
## 4                      0                     0            NaN
## 5                      0                     0            NaN
## 6                      0                     0            NaN
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  4.291169                   2.514548        18348
## 2                  3.508395                   1.977724        18348
## 3                      -Inf                       -Inf        18348
## 4                      -Inf                       -Inf        18348
## 5                      -Inf                       -Inf        18348
## 6                      -Inf                       -Inf        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1             71  NA   NA         NA                           NA
## 2             39  NA   NA         NA                           NA
## 3            -46  NA   NA         NA                           NA
## 4            -38  NA   NA         NA                           NA
## 5            -37  NA   NA         NA                           NA
## 6            -36  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Retail_Recreation 0.9987797 -56.0
Hawaii Grocery_Pharmacy 0.9966955 -34.0
New Hampshire Parks 0.9539650 -20.0
Maine Transit -0.9158183 -50.0
Vermont Parks 0.9045661 -35.5
Utah Residential -0.8934336 12.0
Connecticut Grocery_Pharmacy -0.8877720 -6.0
South Dakota Parks 0.8333624 -26.0
Utah Transit -0.8165011 -18.0
Hawaii Parks 0.7930001 -72.0
Wyoming Parks -0.7900176 -4.0
Arizona Grocery_Pharmacy -0.7640402 -15.0
Rhode Island Workplace -0.7539745 -39.5
Alaska Workplace -0.7533751 -33.0
Massachusetts Workplace -0.7528255 -39.0
Connecticut Transit -0.7494888 -50.0
Hawaii Transit 0.7402388 -89.0
Alaska Grocery_Pharmacy -0.7028034 -7.0
Alaska Residential 0.6881999 13.0
Wyoming Transit -0.6848115 -17.0
Hawaii Residential -0.6723441 19.0
Vermont Grocery_Pharmacy -0.6586667 -25.0
New York Workplace -0.6502915 -34.5
Utah Parks -0.6399312 17.0
Rhode Island Retail_Recreation -0.6296900 -45.0
Maine Workplace -0.6135175 -30.0
Rhode Island Residential -0.6121146 18.5
New Jersey Parks -0.6026463 -6.0
North Dakota Parks 0.6016053 -34.0
Utah Workplace -0.5935270 -37.0
New York Retail_Recreation -0.5882068 -46.0
Arizona Residential 0.5862750 13.0
Nebraska Workplace 0.5835230 -32.0
Arizona Retail_Recreation -0.5811152 -42.5
Nevada Transit -0.5651331 -20.0
New Jersey Workplace -0.5474783 -44.0
New York Parks 0.5291141 20.0
Connecticut Residential 0.5139708 14.0
Connecticut Retail_Recreation -0.5125780 -45.0
Idaho Residential -0.5074536 11.0
Massachusetts Retail_Recreation -0.4992466 -44.0
Kansas Parks 0.4975084 72.0
Maine Parks 0.4974867 -31.0
New Jersey Grocery_Pharmacy -0.4838423 2.5
West Virginia Parks 0.4834858 -33.0
Montana Parks -0.4819729 -58.0
Montana Residential 0.4784686 14.0
New Mexico Grocery_Pharmacy -0.4766466 -11.0
Connecticut Workplace -0.4760756 -39.0
Nebraska Residential -0.4702067 14.0
Rhode Island Parks 0.4677024 52.0
Iowa Parks -0.4669151 28.5
North Dakota Retail_Recreation -0.4632702 -42.0
New Mexico Residential 0.4503268 13.5
New Jersey Retail_Recreation -0.4425634 -62.5
Illinois Transit -0.4343728 -31.0
Maryland Workplace -0.4296123 -35.0
Pennsylvania Workplace -0.4251484 -36.0
Wyoming Workplace -0.4244714 -31.0
New Mexico Parks 0.4234997 -31.5
South Carolina Workplace 0.4206427 -30.0
Vermont Residential 0.4133625 11.5
Massachusetts Grocery_Pharmacy -0.4085799 -7.0
New Jersey Transit -0.4077092 -50.5
New Hampshire Residential -0.4032148 14.0
Arizona Transit 0.4011404 -38.0
Alabama Grocery_Pharmacy -0.4004715 -2.0
Maryland Grocery_Pharmacy -0.3941578 -10.0
Hawaii Workplace 0.3930055 -46.0
Pennsylvania Retail_Recreation -0.3816844 -45.0
Alabama Workplace -0.3725514 -29.0
Kentucky Parks -0.3701377 28.5
New York Transit -0.3696281 -48.0
Idaho Grocery_Pharmacy -0.3598035 -5.5
Michigan Parks 0.3567905 30.0
Idaho Workplace -0.3564397 -29.0
Wisconsin Transit -0.3558993 -23.5
California Transit -0.3536259 -42.0
New Mexico Retail_Recreation -0.3530881 -42.5
Nebraska Grocery_Pharmacy 0.3411673 -0.5
California Residential 0.3311491 14.0
Wyoming Grocery_Pharmacy -0.3277523 -10.0
Idaho Transit -0.3272416 -30.0
Virginia Transit -0.3121680 -33.0
Arkansas Retail_Recreation -0.3109784 -30.0
Maryland Retail_Recreation -0.3102442 -39.0
North Dakota Workplace 0.3102277 -40.0
Maine Retail_Recreation -0.3099200 -42.0
California Parks -0.3038636 -38.5
Arkansas Parks 0.3014703 -12.0
Alaska Retail_Recreation 0.3011420 -39.0
Minnesota Transit -0.2980203 -28.5
Nevada Residential 0.2974694 17.0
North Carolina Grocery_Pharmacy 0.2967883 0.0
Illinois Workplace -0.2930472 -30.5
Vermont Retail_Recreation 0.2873268 -57.0
Pennsylvania Parks 0.2857986 12.0
Colorado Residential 0.2831127 14.0
Alabama Transit -0.2813529 -36.5
Mississippi Residential 0.2807612 13.0
Oregon Grocery_Pharmacy 0.2796478 -7.0
Florida Residential 0.2758202 14.0
North Carolina Workplace 0.2751215 -31.0
Montana Transit 0.2716530 -41.0
Rhode Island Transit -0.2692428 -56.0
Texas Workplace 0.2637551 -32.0
Missouri Residential -0.2621029 13.0
Georgia Grocery_Pharmacy -0.2594159 -10.0
Texas Residential -0.2580896 15.0
Rhode Island Grocery_Pharmacy 0.2569036 -7.5
Maryland Residential 0.2547506 15.0
Kansas Workplace 0.2521460 -32.0
West Virginia Grocery_Pharmacy -0.2507033 -6.0
Vermont Workplace -0.2501332 -43.0
Illinois Parks 0.2486778 26.5
Tennessee Workplace -0.2481524 -31.0
Pennsylvania Grocery_Pharmacy -0.2477512 -6.0
South Carolina Parks -0.2424572 -23.0
Florida Parks -0.2352049 -43.0
New York Grocery_Pharmacy -0.2304903 8.0
Tennessee Residential 0.2253071 11.5
Montana Workplace -0.2237643 -40.0
Arkansas Residential 0.2164083 12.0
Nevada Retail_Recreation -0.2160632 -43.0
Wisconsin Parks 0.2140319 51.5
South Dakota Workplace 0.2139432 -35.0
Georgia Workplace -0.2128105 -33.5
Idaho Retail_Recreation -0.2118740 -39.5
North Carolina Transit 0.2069444 -32.0
Illinois Residential 0.2063787 14.0
North Carolina Residential 0.2058386 13.0
North Dakota Grocery_Pharmacy -0.2026946 -8.0
Washington Grocery_Pharmacy 0.2026796 -7.0
Georgia Retail_Recreation -0.2025986 -41.0
Utah Retail_Recreation -0.2018014 -40.0
Michigan Workplace -0.1989224 -40.0
Mississippi Grocery_Pharmacy -0.1977735 -8.0
West Virginia Workplace 0.1958833 -33.0
Minnesota Parks 0.1958571 -9.0
Kansas Grocery_Pharmacy -0.1940832 -14.0
Colorado Parks -0.1897765 2.0
Oregon Transit 0.1846370 -27.5
Virginia Residential 0.1826114 14.0
Texas Parks 0.1814556 -42.0
Oregon Workplace -0.1810867 -31.0
Connecticut Parks 0.1793709 43.0
Wisconsin Residential -0.1758398 14.0
Arizona Parks -0.1718350 -44.5
New Jersey Residential 0.1681892 18.0
Kentucky Grocery_Pharmacy 0.1673212 4.0
Pennsylvania Transit -0.1670868 -42.0
New Mexico Transit 0.1633965 -38.5
Massachusetts Parks 0.1622467 39.0
New Hampshire Retail_Recreation -0.1615445 -41.0
Kentucky Transit -0.1612657 -31.0
Indiana Residential 0.1611476 12.0
Iowa Transit 0.1592645 -24.0
Ohio Transit 0.1567712 -28.0
Nebraska Transit -0.1564719 -9.0
Missouri Workplace 0.1550891 -28.0
California Grocery_Pharmacy -0.1534044 -11.5
Indiana Retail_Recreation 0.1493922 -38.0
South Dakota Residential 0.1449877 15.0
Nevada Workplace -0.1444749 -40.0
Alabama Parks 0.1440359 -1.0
Mississippi Retail_Recreation -0.1428159 -40.0
Indiana Parks -0.1425519 29.0
Florida Retail_Recreation 0.1420354 -43.0
California Retail_Recreation -0.1419812 -44.0
Virginia Grocery_Pharmacy -0.1415873 -8.0
Washington Workplace -0.1385407 -38.0
Michigan Retail_Recreation -0.1355428 -53.0
Mississippi Transit -0.1337954 -38.5
Georgia Residential -0.1300372 13.0
Wisconsin Workplace -0.1288551 -31.0
North Dakota Transit 0.1261843 -48.0
Minnesota Retail_Recreation 0.1261665 -40.0
South Dakota Transit -0.1239492 -40.0
Missouri Transit -0.1227242 -24.5
Texas Grocery_Pharmacy 0.1193315 -14.0
Virginia Parks 0.1152594 6.0
North Dakota Residential -0.1144400 17.0
Wisconsin Grocery_Pharmacy 0.1111403 -1.0
Kentucky Residential 0.1088598 12.0
Massachusetts Transit -0.1079284 -45.0
Wyoming Residential 0.1078763 12.5
Alabama Retail_Recreation 0.1075013 -39.0
New Hampshire Grocery_Pharmacy -0.1073332 -6.0
Idaho Parks 0.1070409 -22.0
Nebraska Retail_Recreation 0.1060610 -36.0
Kansas Transit -0.1049334 -26.5
Washington Residential 0.1045221 13.0
Arkansas Workplace -0.1040690 -26.0
North Carolina Parks -0.1023378 7.0
Texas Transit 0.1001192 -41.0
Maryland Transit -0.0994219 -39.0
Oklahoma Grocery_Pharmacy -0.0983196 -0.5
Massachusetts Residential 0.0982197 15.0
North Carolina Retail_Recreation 0.0977464 -34.0
Nevada Parks 0.0967757 -12.5
Ohio Parks -0.0963543 67.5
Tennessee Parks -0.0963183 10.5
Ohio Residential 0.0958916 14.0
California Workplace -0.0938442 -36.0
New York Residential 0.0921689 17.5
Minnesota Workplace -0.0897790 -33.0
Indiana Workplace 0.0857317 -34.0
Missouri Parks 0.0850320 0.0
Georgia Parks 0.0849068 -6.0
Pennsylvania Residential 0.0843995 15.0
Virginia Workplace -0.0838587 -31.5
Iowa Workplace 0.0826592 -30.0
Oklahoma Workplace 0.0825475 -31.0
Michigan Residential 0.0811007 15.0
Oklahoma Residential 0.0798125 15.0
Oregon Retail_Recreation 0.0797426 -40.5
Utah Grocery_Pharmacy 0.0774166 -4.0
Virginia Retail_Recreation -0.0751987 -35.0
Michigan Transit 0.0736852 -46.0
Ohio Grocery_Pharmacy 0.0734988 0.0
South Carolina Residential -0.0727587 12.0
Maine Residential -0.0726890 11.0
Colorado Transit 0.0725648 -36.0
Minnesota Grocery_Pharmacy 0.0723379 -6.0
Mississippi Parks -0.0717730 -25.0
Minnesota Residential -0.0717034 17.0
Florida Grocery_Pharmacy 0.0716289 -14.0
Washington Transit -0.0705528 -33.5
Oregon Residential 0.0689470 10.5
Kentucky Retail_Recreation 0.0679218 -29.0
Michigan Grocery_Pharmacy -0.0669978 -11.0
Iowa Grocery_Pharmacy -0.0655352 4.0
Texas Retail_Recreation 0.0636444 -40.0
Wyoming Retail_Recreation -0.0631353 -39.0
Indiana Grocery_Pharmacy -0.0610130 -5.5
Mississippi Workplace -0.0605622 -33.0
Alabama Residential -0.0603595 11.0
West Virginia Transit -0.0599922 -45.0
South Carolina Transit 0.0591155 -45.0
South Carolina Grocery_Pharmacy 0.0555480 1.0
Washington Parks 0.0548889 -3.5
Ohio Retail_Recreation 0.0543137 -36.0
West Virginia Residential -0.0517604 11.0
Indiana Transit 0.0481640 -29.0
Kentucky Workplace -0.0475628 -36.5
Illinois Grocery_Pharmacy -0.0446096 2.0
New Hampshire Workplace 0.0441803 -37.0
Arkansas Transit 0.0423112 -27.0
Maine Grocery_Pharmacy -0.0406139 -13.0
Vermont Transit -0.0393083 -63.0
Florida Workplace -0.0343164 -33.0
Ohio Workplace -0.0342325 -35.0
South Dakota Retail_Recreation -0.0316744 -39.0
Wisconsin Retail_Recreation 0.0307900 -44.0
Oregon Parks -0.0282851 16.5
Colorado Grocery_Pharmacy -0.0278303 -17.0
Arizona Workplace -0.0263563 -35.0
Florida Transit -0.0262279 -49.0
Missouri Retail_Recreation -0.0243049 -36.5
Tennessee Grocery_Pharmacy 0.0239802 6.0
Montana Grocery_Pharmacy -0.0231433 -16.0
Oklahoma Parks 0.0219633 -18.5
Illinois Retail_Recreation 0.0217570 -40.0
Colorado Retail_Recreation -0.0200131 -44.0
New Mexico Workplace 0.0194088 -34.0
Georgia Transit -0.0192936 -35.0
Colorado Workplace 0.0166012 -39.0
Iowa Residential -0.0155830 13.0
Arkansas Grocery_Pharmacy -0.0151347 3.0
Tennessee Transit 0.0150550 -32.0
Kansas Residential -0.0148490 13.0
Oklahoma Retail_Recreation 0.0145489 -31.0
Iowa Retail_Recreation 0.0139679 -38.0
Oklahoma Transit 0.0131918 -26.0
West Virginia Retail_Recreation -0.0125923 -38.5
Nebraska Parks 0.0102216 55.5
Missouri Grocery_Pharmacy 0.0096865 2.0
Nevada Grocery_Pharmacy 0.0094419 -12.5
Kansas Retail_Recreation -0.0090109 -38.0
Tennessee Retail_Recreation -0.0088543 -30.0
Maryland Parks 0.0088046 27.0
New Hampshire Transit -0.0070456 -57.0
Washington Retail_Recreation -0.0067322 -42.0
South Carolina Retail_Recreation -0.0063290 -35.0
Montana Retail_Recreation 0.0015504 -51.0
South Dakota Grocery_Pharmacy 0.0011420 -9.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Mon Jun 8 09:50:17 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net